Generalization and Scaling in Reinforcement Learning

نویسندگان

  • David H. Ackley
  • Michael L. Littman
چکیده

In associative reinforcement learning, an environment generates input vectors, a learning system generates possible output vectors, and a reinforcement function computes feedback signals from the input-output pairs. The task is to discover and remember input-output pairs that generate rewards. Especially difficult cases occur when rewards are rare, since the expected time for any algorithm can grow exponentially with the size of the problem. Nonetheless, if a reinforcement function possesses regularities, and a learning algorithm exploits them, learning time can be reduced below that of non-generalizing algorithms. This paper describes a neural network algorithm called complementary reinforcement back-propagation (CRBP), and reports simulation results on problems designed to offer differing opportunities for generalization. 1 REINFORCEMENT LEARNING REQUIRES SEARCH Reinforcement learning (Sutton, 1984; Barto & Anandan, 1985; Ackley, 1988; Allen, 1989) requires more from a learner than does the more familiar supervised learning paradigm. Supervised learning supplies the correct answers to the learner, whereas reinforcement learning requires the learner to discover the correct outputs before they can be stored. The reinforcement paradigm divides neatly into search and learning aspects: When rewarded the system makes internal adjustments to learn the discovered input-output pair; when punished the system makes internal adjustments to search elsewhere. Generalization and Scaling in Reinforcement Learning 551 1.1 MAKING REINFORCEMENT INTO ERROR Following work by Anderson (1986) and Williams (1988), we extend the backpropagation algorithm to associative reinforcement learning. Start with a "garden variety" backpropagation network: A vector i of n binary input units propagates through zero or more layers of hidden units, ultimately reaching a vector 8 of m sigmoid units, each taking continuous values in the range (0,1). Interpret each 8j as the probability that an associated random bit OJ takes on value 1. Let us call the continuous, deterministic vector 8 the search vector to distinguish it from the stochastic binary output vector o. Given an input vector, we forward propagate to produce a search vector 8, and then perform m independent Bernoulli trials to produce an output vector o. The i 0 pair is evaluated by the reinforcement function and reward or punishment ensues. Suppose reward occurs. We therefore want to make 0 more likely given i. Backpropagation will do just that if we take 0 as the desired target to produce an error vector (0 8) and adjust weights normally. Now suppose punishment occurs, indicating 0 does not correspond with i. By choice of error vector, backpropagation allows us to push the search vector in any direction; which way should we go? In absence of problem-specific information, we cannot pick an appropriate direction with certainty. Any decision will involve assumptions. A very minimal "don't be like 0" assumption-employed in Anderson (1986), Williams (1988), and Ackley (1989)-pushes s directly away from 0 by taking (8 0) as the error vector. A slightly stronger "be like not-o" assumption-employed in Barto & Anandan (1985) and Ackley (1987)-pushes s directly toward the complement of 0 by taking ((1 0) 8) as the error vector. Although the two approaches always agree on the signs of the error terms, they differ in magnitudes. In this work, we explore the second possibility, embodied in an algorithm called complementary reinforcement back-propagation ( CRBP). Figure 1 summarizes the CRBP algorithm. The algorithm in the figure reflects three modifications to the basic approach just sketched. First, in step 2, instead of using the 8j'S directly as probabilities, we found it advantageous to "stretch" the values using a parameter v. When v < 1, it is not necessary for the 8i'S to reach zero or one to produce a deterministic output. Second, in step 6, we found it important to use a smaller learning rate for punishment compared to reward. Third, consider step 7: Another forward propagation is performed, another stochastic binary output vector 0* is generated (using the procedure from step 2), and 0* is compared to o. If they are identical and punishment occurred, or if they are different and reward occurred, then another error vector is generated and another weight update is performed. This loop continues until a different output is generated (in the case of failure) or until the original output is regenerated (in the case of success). This modification improved performance significantly, and added only a small percentage to the total number of weight updates performed. 552 Ackley and Littman O. Build a back propagation network with input dimensionality n and output dimensionality m. Let t = 0 and te = O. 1. Pick random i E 2n and forward propagate to produce a/s. 2. Generate a binary output vector o. Given a uniform random variable ~ E [0,1] and parameter 0 < v < 1, OJ = {1, if(sj !)/v+! ~ ~j 0, otherwise. 3. Compute reinforcement r = f(i,o). Increment t. If r < 0, let te = t. 4. Generate output errors ej. If r > 0, let tj = OJ, otherwise let tj = 1OJ. Let ej = (tj sj)sj(lSj). 5. Backpropagate errors. 6. Update weights. 1:::..Wjk = 1]ekSj, using 1] = 1]+ if r ~ 0, and 1] = 1]otherwise, with parameters 1]+,1]> o. 7. Forward propagate again to produce new Sj's. Generate temporary output vector 0*. If (r > 0 and 0* #0) or (r < 0 and 0* = 0), go to 4. 8. If te ~ t, exit returning te, else go to 1. Figure 1: Complementary Reinforcement Back Propagation-CRBP 2 ON-LINE GENERALIZATION When there are many possible outputs and correct pairings are rare, the computational cost associated with the search for the correct answers can be profound. The search for correct pairings will be accelerated if the search strategy can effectively generalize the reinforcement received on one input to others. The speed of an algorithm on a given problem relative to non-generalizing algorithms provides a measure of generalization that we call on-line generalization. O. Let z be an array of length 2n. Set the z[i] to random numbers from 0 to 2m 1. Let t = te = O. 1. Pick a random input i E 2n. 2. Compute reinforcement r = f(i, z[i]). Increment t. 3. If r < 0 let z[i] = (z[i] + 1) mod 2m , and let te = t. 4. If te ~ then 0 = In else 0 = on]. The i-o mapping is many-to-l. This problem provides an opportunity for what Anderson (1986) called "output generalization": since there are only two correct output states, every pair of output bits are completely correlated in the cases when reward occurs. 107 106 105 G) 'iii u 104 rn C) 103 0 ::::. G) 102 E ; 10 1 10 0 0 1 2 3 456 78 91011121314 n Figure 3: The n-majority problem x Table D CRBP n-n-n + CRBP n-n Figure 3 displays the simulation results. Note that although Trer is faster than CRBP at small values of n, CRBP's slower growth rate (1.6n vs 4.2n ) allows it to cross over and begin outperforming Trer at about 6 bits. Note also--in violation of 1 For n = 1 to 12. we used '1+ = {2.000. 1.550. 1.130.0.979.0.783.0.709.0.623.0.525.0.280. 0.219. 0.170. 0.121}. 554 Ackley and Littman -G) 'ii tA Q 0 ::::. G) .5 some conventional wisdom-that although n-majority is a linearly separable problem, the performance of CRBP with hidden units is better than without. Hidden units can be helpful--even on linearly separable problems-when there are opportunities for output generalization. 3.2 n-COPY AND THE 2k -ATTRACTORS FAMILY As a second example, consider the n-copy problem: [0 = i]. The i-o mapping is now 1-1, and the values of output bits in rewarding states are completely uncorrelated, but the value of each output bit is completely correlated with the value of the corresponding input bit. Figure 4 displays the simulation results. Once again, at 107 106 105 104 103 102 10 1 100 0 150*2.0I\n

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Reinforcement Algorithms Using Functional Approximation for Generalization and their Application to Cart Centering and Fractal Compression

We address the conflict between identification and control or alternatively, the conflict between exploration and exploitation, within the framework of reinforcement learning. Qlearning has recently become a popular offpolicy reinforcement learning method. The conflict between exploration and exploitation slows down Q-learning algorithms; their performance does not scale up and degrades rapidly...

متن کامل

Abstraction and Generalization in Reinforcement Learning: A Summary and Framework

ion and Generalization in Reinforcement Learning: A Summary and Framework Marc Ponsen, Matthew E. Taylor, and Karl Tuyls 1 Universiteit Maastricht, Maastricht, The Netherlands {m.ponsen,k.tuyls}@maastrichtuniversity.nl 2 The University of Southern California, Los Angeles, CA [email protected] Abstract. In this paper we survey the basics of reinforcement learning, generalization and abstraction. W...

متن کامل

A Reinforcement-and-Generalization Model of Sequential Effects in Identification Learning

Responses in identification-learning tasks depend on events from recent trials. A model for these sequential effects is proposed, based on previous work in category learning and founded on theories of reinforcement learning and generalization. The model is compared to two other theories in their predictions of the influence of previous stimuli and previous feedback. Two experimental paradigms a...

متن کامل

Relational Representations in Reinforcement Learning: Review and Open Problems

This paper is about representation in RL. We discuss some of the concepts in representation and generalization in reinforcement learning and argue for higher-order representations, instead of the commonly used propositional representations. The paper contains a small review of current reinforcement learning systems using higher-order representations, followed by a brief discussion. The paper en...

متن کامل

Unspecific Reinforcement Learning in One and Two - layered Networks

The dynamics of on-line learning of a perceptron with a learning rule based on the Hebb rule with “delayed” unspecific reinforcement is studied for a special case of the feedback definition. This learning algorithm combines an associative and a reinforcement step and the relevant learning parameter λ represents the ratio of the associative to the reinforcement step. For given initial conditions...

متن کامل

Utilizing Generalized Learning Automata for Finding Optimal Policies in MMDPs

Multi agent Markov decision processes (MMDPs), as the generalization of Markov decision processes to the multi agent case, have long been used for modeling multi agent system and are used as a suitable framework for Multi agent Reinforcement Learning. In this paper, a generalized learning automata based algorithm for finding optimal policies in MMDP is proposed. In the proposed algorithm, MMDP ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1989